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Abstract 


Autonomous  estimation  of  the  altitude  of  an  Unmanned  Aerial  Vehicle  (UAV)  is  extremely 
important  when  dealing  with  flight  maneuvers  like  landing,  steady  flight,  etc.  Vision  based 
techniques  for  solving  this  problem  have  been  underutilized.  In  this  thesis,  we  propose  a  new 
algorithm  to  estimate  the  altitude  of  a  UAV  from  top-down  aerial  images  taken  from  a  single 
on-board  camera.  We  use  a  semi-supervised  machine  learning  approach  to  solve  the  problem. 
The  basic  idea  of  our  technique  is  to  learn  the  mapping  between  the  texture  information 
contained  in  an  image  to  a  possible  altitude  value.  We  learn  an  over  complete  sparse  basis 
set  from  a  corpus  of  unlabeled  images  capturing  the  texture  variations.  This  is  followed 
by  regression  of  this  basis  set  against  a  training  set  of  altitudes.  Finally,  a  spatio-temporal 
Markov  Random  Field  is  modeled  over  the  altitudes  in  test  images,  which  is  maximized 
over  the  posterior  distribution  using  the  MAP  estimate  by  solving  a  quadratic  optimization 
problem  with  LI  regularity  constraints.  The  method  is  evaluated  in  a  laboratory  setting  with 
a  real  helicopter  and  is  found  to  provide  promising  results  with  sufficiently  fast  turnaround 
time. 
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1  Introduction 

Unmanned  Aerial  Vehicles  (UAVs)  have  been  an  active  area  of  research  in  the  recent  years. 
UAVs  have  been  found  to  be  an  ideal  platform  for  a  number  of  civilian  and  military  tasks 
like  visual  surveillance,  inspection,  firefighting,  policing  civil  disturbances  or  reconnaissance 
support  in  natural  disasters.  The  ability  of  UAVs  to  fly  at  low  speeds,  hover  or  fly  laterally 
and  perform  maneuvers  in  narrow  spaces  facilitate  them  for  these  tasks.  One  of  the  most 
important  tasks  in  achieving  UAV  autonomy  is  autonomous  navigation,  which  needs  good 
altitude  estimation  techniques.  The  main  surge  in  mini-UAV  designs  these  days  have  been 
on  optimizing  and  miniaturizing  the  hardware  and  putting  multiple  functionalities  into 
the  same  device.  On-board  cameras  are  indispensible  components  of  a  UAV,  enabling  it 
for  environment  monitoring,  tracking  etc.  Compared  to  other  sensors,  (e.g.  laser),  video 
cameras  are  quite  light  and  less  power  hungry.  In  this  thesis,  we  investigate  the  idea  of 
estimating  the  altitude  of  a  UAV  from  the  images  taken  from  a  single  on  board  camera 
using  machine  learning  techniques. 

Vision  based  control  of  an  autonomous  helicopter  has  been  investigated  quite  thoroughly 
in  the  previous  years.  Different  camera  systems  and  arrangements  have  been  tried.  A 
downward- looking  camera  with  a  standard  lens  has  been  investigated  in  [11],  [4],  [15],  but 
the  state  estimation  of  their  approach  is  relative  to  the  specifics  of  a  given  landing  pad. 
In  [9],  a  multi  view  geometry  based  approach  to  build  a  digital  map  of  the  ground  is 
suggested.  They  use  aerial  image  sequences  taken  from  a  side  looking  helicopter  camera, 
with  the  assumption  that  there  are  uniquely  recognizable  features  in  the  vicinity  of  the 
UAV  to  correlate  the  images  in  the  sequence.  An  application  of  omni  directional  cameras 
for  vision  based  navigation  is  described  in  [14],  but  the  environment  over  which  this  is 
used  seems  very  restrictive.  A  reinforcement  learning  strategy  for  performing  various  flight 
maneuvers  have  been  investigated  in  [10],  but  they  do  not  use  any  vision  based  techniques. 

To  the  best  of  our  knowledge,  this  is  the  first  time  that  the  problem  of  altitude  estimation 
of  a  UAV  has  been  studied  exclusively  and  a  machine  learning  framework  being  suggested. 
The  motivation  for  this  research  stems  from  the  recent  developments  in  the  area  of  3D 
reconstruction  using  monocular  cues.  In  [2],  [1]  and  [3]  Saxena  et.  al.  proposes  an  algorithm 
for  building  a  depth  map  from  a  single  image.  The  algorithm  uses  a  Markov  Random  Field 
(MRF)  based  supervised  learning  to  build  a  model  of  the  variation  of  depth  at  each  pixel 
in  a  given  image  against  a  set  of  feature  vectors  computed  from  those  pixels.  But  we 
found  that  their  method  cannot  be  applied  to  our  problem  due  to  the  following  reasons: 
(i)  we  have  top-down  aerial  views,  (ii)  there  is  little  structure  compared  to  images  taken 
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on  ground  and  (iii)  we  assume  that  the  ground  plane  is  flat;  thus  needing  to  compute 
only  a  single  altitude  from  the  entire  image.  We  also  assume  that  the  UAV  does  not 
make  sudden  changes  in  altitude  such  that  the  deviation  of  altitude  from  one  image  to  its 
preceding  images  is  smooth.  We  incorporate  this  information  also  into  our  model  to  refine 
the  predicted  altitude.  To  account  for  issue  (ii),  we  suggest  a  semi-supervised  learning 
method  for  learning  a  sparse  overcomplete  basis  from  a  corpus  of  possible  terrain  images. 
This  is  in  lines  of  the  Self-Taught  Learning  strategy  proposed  in  [12].  Self-taught  learning  is 
a  kind  of  transfer  learning  which  is  based  on  the  assumption  that  any  image  consists  of  some 
basic  ingredients  like  edges,  textures,  etc  and  thus  learning  a  sparse  overcomplete  bases  over 
a  random  set  of  images  provide  a  powerful  representation  system  to  model  any  given  image 
as  a  sparse  linear  combination  of  these  bases.  But  our  approach  is  not  transfer  learning 
and  we  use  only  aerial  images  of  terrains  where  the  UAV  will  fly.  Later,  we  do  supervised 
regression  over  this  basis  using  the  altitudes  we  have  from  a  given  training  set.  Finally,  we 
introduce  a  novel  spatio-temporal  MRF  model  to  estimate  the  altitude  of  a  patch  in  the 
image  to  the  altitude  of  other  patches  in  the  same  image  and  patches  across  images  in  the 
earlier  time  frames.  The  MRF  model  is  later  solved  for  the  Maximum  A  Posteriori  (MAP) 
estimate  of  the  altitude. 

The  rest  of  the  document  is  organized  as  follows:  We  begin  with  an  overview  of  our 
motivation  for  using  texture  based  techniques  for  altitude  estimation  in  Chapter  2,  which 
precedes  a  discussion  on  computing  the  feature  vectors.  In  Chapter  3,  we  propose  a  proba¬ 
bility  model  for  the  problem  and  optimization  techniques  for  solving  it.  Chapter  4  discusses 
our  experiments  and  we  conclude  in  Chapter  5. 
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2  Feature  Vector 

Given  a  video  of  altitude  variations  taken  using  a  fixed  focal  length  moving  camera,  humans 
will  not  have  much  of  a  difficulty  in  inferring  the  altitudes  across  frames.  For  example,  we 
can  easily  say  if  an  image  was  taken  too  close  to  the  ground  or  far  away  or  how  much  is 
the  relative  difference  in  altitudes  between  two  given  images.  This  is  not  only  attributed 
to  our  prior  knowledge  about  the  environment,  but  also  to  our  capability  for  using  monoc¬ 
ular  cues  such  as  texture  variations,  known  object  sizes,  haze,  focus/de-focus,  etc  in  the 
inference.  Texture  gradients  capture  the  distribution  of  the  direction  of  the  edges.  It  is  a 
valuable  source  of  depth  cues  and  has  been  used  quite  effectively  in  papers  like  [2],  [1]  for 
3D  reconstruction. 

When  dealing  with  aerial  images  taken  from  a  UAV,  we  have  to  face  some  more  issues 
that  cannot  be  adequately  captured  by  texture  variations  alone.  For  example,  most  of 
the  images  are  too  noisy,  have  a  variety  of  illumination  differences,  or  are  often  blurred 
by  the  motion  of  the  UAV.  Moreover,  aerial  images  lack  structure  compared  to  images 
taken  on  ground.  For  example,  in  ground  images,  we  can  probably  assume  that  there  is  a 
ground  plane,  all  objects  stand  on  the  ground,  etc.  But  aerial  images  with  top-down  views 
look  like  random  patches  and  application  of  conventional  filters  like  autocorrelation  filters, 
fourier/ wavelets  based  filters,  texture  gradient  filters  like  Nevatia-Babu,  Laws  masks  filters, 
etc  cannot  effectively  capture  the  texture  variations  to  the  respective  altitude  variations. 
Fig.  1  shows  a  few  sample  images  that  we  will  be  working  with  in  this  thesis.  They  were 
taken  in  our  laboratory  setting  and  the  altitude  at  which  each  image  was  taken  is  also 
mentioned.  Note  the  variation  in  texture  as  the  altitude  increases. 

The  motivation  for  our  approach  to  solve  this  problem  stems  from  the  recent  develop¬ 
ments  in  sparse  coding  for  compressed  sensing  [5],  where  information  is  encoded  using  a 
sparse  overcomplete  basis  which  effectively  captures  higher  level  information  in  the  data, 
leading  to  a  close  to  perfect  reconstruction.  The  method  was  found  to  be  robust  to  noise 
and  relatively  immune  to  illumination  variations.  In  sparse  coding,  only  a  very  few  vectors 
from  the  basis  set  are  needed  to  reconstruct  a  given  image  patch.  Thus,  a  regression  of 
this  active  set  against  altitudes  provide  a  good  representative  relationship  between  altitude 
variations  against  the  texture  differences.  Also,  we  would  like  to  reduce  the  computational 
time  for  feature  extraction  and  at  the  same  time  not  compromising  on  the  generality  of  the 
representation.  We  felt,  conventional  approaches  using  filter  banks  might  not  adhere  to  this 
requirement.  For  example,  in  [2],  a  filter  bank  of  510  dimensions  is  suggested.  This  increases 
the  feature  extraction  time  as  well  as  the  altitude  prediction  time.  Fast  turn-around  time 
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Figure  1:  Sample  images  of  the  top-down  aerial  views  from  an  onboard  camera  of  a  UAV 
in  the  laboratory  setting.  The  altitude  at  which  each  subimage  was  taken  is  also  shown. 


is  a  critical  aspect  in  our  application. 

In  [12] ,  an  efficient  framework  for  learning  such  a  sparse  overcomplete  basis  is  suggested, 
which  is  later  used  for  object  classification.  Our  problem  is  different  from  their  approach, 
in  that  we  do  not  learn  basis  from  completely  random  images,  but  from  aerial  images  of 
various  terrains.  Thus  our  philosophy  is  closer  to  a  semi-supervised  learning  [17]  setting, 
although  we  use  their  model  to  learn  the  basis.  In  order  to  prove  the  generality  of  our 
approach  to  arbitrary  scenarios,  we  used  random  aerial  images  of  various  terrains  from  the 
internet  to  build  our  basis  set.  A  few  sample  images  that  we  used  for  this  purpose  are 
shown  in  Fig.  2. 


Figure  2:  Random  aerial  images  downloaded  from  the  internet  for  learning  the  basis  set. 

Given  a  large  corpus  of  image  patches  I  =  {R, each  patch  is  vectorized  as  a 
k  dimensional  input  vector  y.  The  goal  of  sparse  coding  is  to  represent  these  vectors  as 
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a  sparse  approximate  weighted  linear  combination  of  n  basis  vectors.  That  is,  for  the 
input  vector  y*  e  , 

n 

=  (1) 

i=i 

where  6i,  62, e  R^  are  the  basis  vectors  and  a*  e  R”  is  a  sparse  vector  of  coefficients. 
Unlike  similar  methods  such  as  PCA,  the  basis  set  B  that  we  use  here  can  be  overcomplete 
{n  >  k),  and  can  represent  nonlinear  features  of  y.  To  find  the  optimal  B  and  a*’s,  we  solve 
the  following  optimization  problem  as  formulated  by  [12]: 

min  V  ||y*  -  Va*6j||2  +  /3||a*||i  (2) 

b,a  ^ ^  ^ ^ 

*  i 

s.t.  ||6j||2  <  1,  V  j  e  {l,...,n} 

The  optimization  objective  of  (2)  balances  two  terms:  (i)  the  first  quadratic  term  en¬ 
courages  each  input  y*  to  be  reconstructed  well,  as  a  weighted  linear  combination  of  the 
basis  bj  with  the  corresponding  weights  given  by  the  activations  a*-,  and  (ii)  it  encourages 
the  activations  to  have  low  Li  norm,  which  encourages  a*  to  be  sparse.  The  optimization 
problem  is  convex  over  each  subset  of  variables  a  and  b,  but  is  not  jointly  convex.  More 
specifically,  the  problem  on  activations  a  is  an  LI  constrained  least  squares  problem,  where 
as  the  one  on  the  basis  b  is  an  L2  regularized  least  squares  problem.  The  paper  [7]  provides 
an  algorithm  to  solve  these  two  sub-problems  efficiently.  Fig.  3  shows  a  basis  set  learnt 
using  the  above  algorithm  using  random  aerial  images  downloaded  from  the  internet. 

Once  a  sparse  basis  set  B  e  is  obtained,  we  can  construct  a  feature  vector  /  for 

a  given  vectorized  image  patch  p  of  dimension  k  by  computing  the  activations  on  the  basis 
that  will  produce  this  patch.  That  is, 

mm||p-^/,Lj||^  +  /3||/||i  (3) 

j 

The  feature  vectors  /  from  all  the  patches  y  in  a  given  image  are  stacked  up  to  form  the 
feature  vector  set  F  that  is  used  in  the  following  sections. 
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Figure  3:  350  basis  vectors  each  of  size  16x16  learned  using  50000  patches  from  random 
internet  aerial  images. 
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3  Inference  Model 


Now  that  we  have  a  comprehensive  representation  of  the  texture  of  an  image  as  a  linear 
combination  of  the  basis,  a  supervised  learning  algorithm  modeled  on  a  spatio-temporal 
Gaussian  Markov  Random  Field  (MRF)  [13]  is  deployed  to  estimate  the  posterior  distribu¬ 
tion  of  the  altitude  for  every  pixel  block  in  the  image.  We  model  the  posterior  distribution 
of  altitude  d  given  the  feature  vectors  set  F  and  parameters  a  and  6  as: 


where 


P{d\F;a,0)  =  —exp{-E^fi{d,F)) 


Ea,e{d,  F)  = 


(d*  -  Fiey^ 


i=l 


(y% 


+  EE 

j=i  1=1 


(4) 


n  n 

+f>T,  E  <5) 

i=l  j=l,j^i 

Here,  Z  is  a  normalization  constant,  Efj^g(d,F)  defines  the  Gibbs  energy  function  and 
Fi  is  the  feature  vector  at  pixel  block  i  of  the  image  computed  using  (3).  The  first  term  in 
(5)  models  the  raw  altitude  at  the  pixel  block  i  in  terms  of  feature  vectors  Fi  through  the 
regressor  9.  As  it  is  apparent,  we  use  a  linear  relationship  between  Fi  and  d.  Linear  least 
squares  regression  over  the  training  set  is  used  to  find  the  vector  9. 

We  assume  that  the  altitude  of  the  UAV  will  not  change  abruptly,  but  rather  smoothly. 
Thus  the  altitude  predicted  from  one  image  frame  to  the  next  frame  should  not  have  drastic 
deviations  in  the  predicted  altitudes.  This  is  formulated  by  the  second  term  in  (5),  which 
constrains  the  altitude  at  pixel  block  i  at  time  t,  to  be  smooth  to  the  pixel  block  i  of  the 
image  frame  at  time  t  —  j.  The  third  term  in  Eq.  (5)  captures  the  relationship  between 
the  altitude  predicted  at  pixel  block  i  to  all  other  blocks  in  the  same  image.  Since  we  are 
working  with  aerial  images  with  top-down  view,  and  since  we  assume  that  the  ground  is 
flat,  there  will  be  only  a  single  altitude  value  for  the  whole  image.  Thus  we  would  like  to 
constrain  the  altitudes  predicted  from  different  pixel  blocks  of  one  frame  to  be  as  equal  as 
possible.  That  is,  if  di  and  dj  represent  the  altitudes  at  any  two  pixel  blocks  in  the  same 
frame,  then  (di  —  dj)  should  be  as  close  to  zero  as  possible.  Fig.  4  shows  a  schema  of  the 
MRF  setup  we  are  assuming  in  our  computations. 

The  parameter  Ua  ensures  smoothness  of  our  raw  predictions  of  altitude.  From  our 
experiments,  we  found  that  there  was  heteroskedasticity  in  the  data,  i.e.,  the  variance  of 
the  prediction  error  varied  with  the  texture  of  the  image.  In  order  to  accomodate  for 
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Figure  4:  The  MRF  model  we  assume.  The  schema  represents  the  dependency  of  the 
altitude  at  pixel  block  i  to  other  pixel  blocks  as  modeled  by  the  spatio-temporal  MRF.  In 
the  schema  we  show  frames  at  times  t,  t  —  —  T.  Each  solid  black  dot  d]  shows  the 

altitude  at  patch  i  at  time  t.  The  solid  lines  show  the  equality  constraints  we  impose  for 
the  altitude  predictions  and  dotted  lines  capture  the  smoothness  across  frames. 


this  variation,  a  separate  aa  parameter  was  estimated  for  each  image  by  projecting  the 
individual  feature  vectors  Fi  in  the  image  patches  to  a  regression  hyperplane  S  that  captures 
the  expected  error.  That  is,  a  hyperplane  S  is  first  estimated  by  linear  least  squares  over 
E[{d^  —  FiO)"^]  =  S'^Fi  from  the  training  data.  Later,  given  the  feature  vectors  Fi  from  a  test 
image,  we  calculate  aa  =  E[||5'^Fi||],  i.e.  by  averaging  over  the  individual  feature  vectors 
projected  on  to  S. 

A  different  strategy  was  used  to  find  the  aj  parameters.  These  parameters  capture  the 
variance  of  the  altitude  estimates  across  frames.  We  assume  that  the  altitude  variations  are 
smooth  and  also  they  are  consistent  across  all  the  patches  in  a  single  image.  We  estimate 
the  hyperplane  parameter  Sj  from  the  Eigen  analysis  of  (d*  —  d\_j),  where  di  is  the  true 
altitude  from  the  training  data  and  dl_j  represents  the  estimated  altitude  of  ith  pixel  block 
at  the  earlier  time  step  from  the  current  step  t.  The  covariance  matrix  is  formed  by 
stacking  up  the  predicted  error  across  patches  in  a  column  and  the  variations  across  frames 
in  rows.  Finally,  aj  =  \\Sjdl_j\\  is  computed.  Note  that  in  the  Eigen  analysis,  we  assume 
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that  the  error  in  d\_j  is  Normal  distributed  and  cL  =  E{d\_j)  for  varying  j.  The  parameter 
fl  in  the  model  controls  how  much  the  predicted  altitudes  in  a  frame  d*  and  d^  are  close  and 
is  computed  experimentally  through  cross  validation. 

Once  the  parameters  are  estimated  from  the  training  data,  given  a  test  frame,  we  use 
the  Maximum  A  Posteriori  (MAP)  over  d  to  estimate  the  altitude  d  over  the  entire  image. 
The  MAP  problem  can  be  stated  as  follows. 


min  E„^g{d,F). 
d 


(6) 


To  solve  this  problem,  first  (5)  is  reformulated  as  an  LI  and  L2  constrained  optimization 
problem  through  the  following  linear  algebra  transformations.  Let  D  is  an  n  x  1  vector  built 
by  stacking  up  all  the  altitude  values  d  over  the  entire  image.  Let  Y  is  also  an  n  x  1  vector 
created  by  stacking  up  the  corresponding  F^B  values.  Further,  let  E  and  /  be  x  n  matrices 
defined  as  follows: 


/  In 
0 


E  = 


0 


0  0  0 
0  0  0 


0 


/ 


Inxn 

Inxn 


\ 


I  = 


\0  0  000  In/ 


/ 


where  In  is  a  vector  of  n  ones  and  Inxn  denotes  an  n  x  n  identity  matrix.  Thus  (6)  in  terms 
of  (5)  can  be  rewritten  as 


min 

D 


D-Y)\\l 

rr2 


+  ;3||(£-/)D||i 

^  cr| 

7  =  1  J 


(7) 


Assuming  X  =  {E  —  I)D,  this  becomes  a  standard  LI  and  L2  constrained  quadratic  opti¬ 
mization  problem,  which  can  be  solved  efficiently  by  a  modified  version  of  the  Feature-Sign- 
Search  algorithm  depicted  in  [7].  The  basic  idea  of  this  algorithm  is  that  if  we  can  infer 
the  correct  signs  of  the  elements  in  the  vector  X,  then  the  LI  minimization  problem  can 
be  converted  to  a  standard  L2  minimization  problem  which  can  be  solved  efficiently  using 
conventional  optimization  techniques. 
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4  Experiments 

4.1  UAV  System 

The  experiments  in  this  thesis  were  conducted  in  a  laboratory  setting  using  a  Blade  CX2 
helicopter  from  E-Flite  [8].  Fig.  5  shows  the  helicopter  with  the  position  and  orientation  of 
the  video  camera.  We  used  an  Fyecam  2.4  GHz  Color  Micro  Wireless  Video  Camera  System 
[6]  onboard  that  captures  NTSC  video  at  250K  pixels  and  transmits  it  using  a  frequency 
set  in  the  2.5  GHz  spectrum.  To  track  the  position  of  the  helicopter  during  test  flights,  6 
Vicon  high- resolution  MX-40  grayscale  cameras  [16]  with  a  resolution  of  2352  x  1728  pixels 
was  deployed  along  the  perimeter  of  the  lab  as  shown  in  Fig.  6,  giving  an  experimental  area 
of  4.5m  X  4.5m  x  2m.  The  Vicon  cameras  send  information  every  7ms  through  high  speed 
Ethernet  cables  to  a  central  router,  which  collates  the  data,  providing  it  to  a  PC  running 
the  ViconlQ  software  recording  the  true  altitude. 


Figure  5:  The  Blade  CX2  helicopter  that  was  used  for  the  experiments.  It  is  a  coaxial 
helicopter  with  a  rotor  diameter  of  34.5  cms,  a  height  of  18.3  cms  and  weighs  approximately 
220g  (with  battery). 
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Figure  6:  The  experimental  setup  for  tracking  the  position  of  the  helicopter.  6  high- 
resolution  Vicon  cameras  (only  three  shown  in  the  picture)  were  placed  along  the  perimeter 
of  the  lab  that  tracks  the  reflective  markers  on  the  helicopter  (seen  with  bright  dots  in  the 
picture)  determining  the  position. 

4.2  Learning  Setup 

To  learn  the  basis  set,  we  collected  approximately  250  random  aerial  images  (640  x  480)  from 
the  internet.  Each  image  was  converted  to  gray  scale,  which  preceds  segregating  the  images 
into  patches  of  size  10  x  10,  vectorizing  them,  and  later  using  (2)  to  build  the  basis.  Fig.  7 
shows  the  plot  of  the  mean  absolute  error  in  altitude  prediction  against  various  choices  of 
the  basis  set  size.  Trading-off  between  the  accuracy  of  altitude  prediction  (as  seen  from 
Fig.  7)  and  the  computational  time,  we  fixed  on  200  basis  vectors  for  our  experiments.  The 
parameters  of  the  MRF  model  were  then  estimated  by  regressing  the  basis  on  patches  from 
a  training  set  of  200  images  taken  using  the  camera  on  the  helicopter.  These  images  were  of 
the  kind  shown  in  Fig.  1.  Finally  the  model  was  tested  on  pre-recorded  lab  flight  sessions  by 
minimizing  (7).  Each  test  image  in  the  video  sequence  was  first  applied  with  Wiener  filters 
to  account  for  the  motion  blur  and  later  convoluted  with  Gaussian  filters  for  smoothing.  To 
improve  the  speed  of  prediction,  we  used  only  the  middle  100  x  100  section  of  each  image. 

4.3  Altitude  Prediction 

All  the  algorithms  were  implemented  in  Matlab.  Testing  the  framework  took  less  than  0.5 
seconds  per  image  on  a  PC  with  a  2Ghz  Pentium  processor  and  2GB  RAM.  Fig.  8  plots 
the  true  and  predicted  altitudes  for  various  experimental  UAV  flight  sessions.  As  is  seen 
from  the  plots,  predicted  altitude  gives  a  very  good  approximation  to  the  true  altitude 
most  of  the  times.  The  prediction  error  increases  as  true  altitude  of  the  UAV  goes  high. 
This  is  expected,  as  at  higher  altitudes,  the  ground  is  seen  as  almost  textureless.  Thus  our 
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Figure  7:  A  plot  of  the  mean  absolute  error  of  prediction  (y  axis  in  m)  against  the  number 
of  basis  vectors  used  (x  axis).  Each  basis  vector  is  of  size  100  x  1. 


algorithm  is  mostly  applicable  to  low  altitude  situations,  like  landing  or  low  altitude  flight, 
unless  we  have  a  more  powerful  camera  onboard. 

4.4  Robustness  Evaluation 

To  evaluate  the  robustness  of  our  algorithm  with  non- flat  surfaces  and  across  varying  ground 
textures,  we  placed  boxes  of  varying  textures  in  the  experimental  area  and  the  helicopter 
made  to  fly  over  them.  Fig.  9  shows  the  results  of  this  experiment.  As  is  seen,  our  algorithm 
performs  well  in  finding  the  altitude  variations  due  to  the  presence  of  the  obstacles.  We 
repeated  the  experiments  with  boxes  of  high  illumination  and  with  non-textured  surfaces, 
the  outputs  of  which  are  shown  in  Fig.  10.  As  expected,  in  this  case,  the  algorithm  gets 
confused  between  textureless  surfaces  to  textures  at  high  altitudes  and  performs  poorly. 
A  summary  of  the  various  experiments  showing  the  minimum,  maximum,  mean  and  the 
standard  deviations  of  the  errrors  respectively  is  given  in  Table  1. 


Experiment 

Minerror{m) 

Maxerror{m) 

M  eanerror{m) 

Std. 

Without  boxes 

0.0 

1.13 

0.35 

0.26 

With  boxes 

0.0 

1.56 

0.48 

0.36 

Table  1:  A  summary  of  the  various  flight  experiments. 
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200 
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Figure  8:  Prediction  of  altitude  for  various  flight  sessions.  The  X  axis  shows  the  video  frame 
number  and  Y  axis  gives  the  altitude.  The  blue  curve  shows  the  true  altitude,  the  black 
curve  shows  the  raw  output  using  the  linear  prediction  before  MAP  and  red  curve  shows 
the  final  output  after  the  MAP  estimation.  The  helicopter  was  made  to  fly  up  and  down 
which  accounts  for  the  variations  in  the  blue  curve. 


4  EXPERIMENTS 


14 


Video  frame  number 


Figure  9:  The  prediction  accuracy  when  there  were  altitude  variations  of  the  ground.  We 
kept  a  few  boxes  on  the  floor  and  the  helicopter  was  made  to  fly  over  it.  The  plot  shows 
the  true  altitude  of  the  helicopter  (black)  subtracting  off  the  height  of  the  boxes  (whenever 
it  flew  over  it)  and  the  predicted  altitude  (red).  As  is  seen,  our  altitude  prediction  is  very 
close  to  the  black  curve  most  of  the  times. 
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w  True  Altitude 


Figure  10:  Predition  accuracy  for  illumination  (former)  and  non  textured  surfaces.  The  plot 
shows  the  true  altitude  of  the  helicopter  from  the  boxes  (blue)  and  the  predicted  altitude 
(red).  Note  the  spikes  in  the  predicted  altitude  where  the  algorithm  gets  confused  with  non- 
textured  surfaces  and  high  altitude  surfaces.  The  spikes  appear  at  positions  when  actually 
the  helicopter  gets  closer  to  the  boxes. 
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5  Conclusion  and  Discussion 

In  this  thesis,  we  investigated  the  possibility  of  predicting  the  altitude  of  a  UAV  from 
ground  looking  image  sequences  taken  from  a  single  onboard  camera  in  an  indoor  setting. 
We  found  that  sparse  coding  can  effectively  capture  the  texture  in  an  image.  Supervised 
regression  using  this  sparse  basis  against  a  training  set  of  altitudes  provided  a  good  pre¬ 
diction  setup.  Later,  a  spatio-temporal  MRF  was  modeled  and  its  MAP  estimate  with 
respect  to  the  altitude  was  computed.  The  effectiveness  of  our  approach  was  substantiated 
through  laboratory  experiments.  We  found  that  as  the  altitude  of  the  UAV  goes  over  a 
certain  height,  the  images  become  textureless  (which  depends  on  the  specific  environment 
though)  and  our  algorithm  performs  poorly.  Thus  our  mechanism  is  most  suited  for  situ¬ 
ations  where  the  UAV  flies  at  low  altitudes  and  at  low  speeds.  Also,  since  the  prediction 
is  based  on  the  texture  variations  in  the  image,  the  mechanism  performs  poorly  when  the 
ground  surface  is  textureless.  A  way  to  improve  our  algorithm  in  such  a  situation  will  be  to 
incorporate  information  from  other  sensors  like  ultrasonic/infra-red  as  priors  to  the  MRF 
model.  Another  direction  to  carry  forward  this  work  would  be  to  extend  our  framework  to 
forward-looking  onboard  cameras.  Such  a  setup  could  enable  not  only  altitude  estimation 
but  also  visual  navigation  and  obstacle  avoidance.  These  are  topics  for  future  research. 
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